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Abstract 

We study the interpretability of conditional 
probability estimates for binary classification 
under the agnostic setting or scenario. Un¬ 
der the agnostic setting, conditional proba¬ 
bility estimates do not necessarily reflect the 
true conditional probabilities. Instead, they 
have a certain calibration property: among 
all data points that the classifier has pre¬ 
dicted V{Y = 1|A) = p, p portion of them 
actually have label Y = 1. For cost-sensitive 
decision problems, this calibration property 
provides adequate support for us to use Bayes 
Decision Theory. In this paper, we define 
a novel measure for the calibration prop¬ 
erty together with its empirical counterpart, 
and prove an uniform convergence result be¬ 
tween them. This new measure enables us to 
formally justify the calibration property of 
conditional probability estimations, and pro¬ 
vides new insights on the problem of estimat¬ 
ing and calibrating conditional probabilities. 

1 Introduction 

Many binary classification algorithms, such as naive 
Bayes and logistic regression, naturally produce confi¬ 
dence measures in the form of conditional probability 
of labels. These confidence measures are usually inter¬ 
preted as the conditional probability of the label y = 1 
given the feature x. An important research question is 
how to justify these conditional probabilities, i.e., how 
to prove the trustworthiness of such results. 

In classical statistics, this question is usually studied 
under the realizable assumption, which assumes that 
the true underlying probability distribution has the 
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same parametric form as the model assumption. More 
explicitly, statisticians usually construct a parametric 
conditional distribution V{Y\X,9), and assume that 
the true conditional distribution is also of this form 
(with unknown 6). The justification of conditional 
probabilities can then be achieved by using either hy¬ 
pothesis testing or confidence interval estimation on 

e. 

However, in modern data analysis workflows, the real¬ 
izable assumption is often violated, e.g. data analysts 
usually try out several off-the-shelf classification al¬ 
gorithms to identify those that work the best. This 
setting is often called agnostic — essentially implying 
that we do not have any knowledge about the under¬ 
lying distribution. Under the agnostic setting, condi¬ 
tional probability estimates can no longer be justified 
by standard statistical tools, as most hypothesis test¬ 
ing methods are designed to distinguish two parameter 
areas in the hypothesis space (e.g., 9 < Oq v.s. 9 > 9o), 
and confidence intervals require realizable assumption 
to be interpretable. 

In this paper, we study the interpretability of condi¬ 
tional probabilities in binary classification in the ag¬ 
nostic setting: what kind of guarantees can we have 
without making any assumption on the underlying dis¬ 
tribution? Justifying these conditional probabilities is 
important for applications that explicitly utilize the 
conditional probability estimates of the labels, includ¬ 
ing medical diagnostic systems (Cooper, 1984) and 
fraud detection (Fawcett and Provost, 1997). In such 
applications, the misclassification loss function is of¬ 
ten asymmetric (i.e., false positive and false negative 
incur different loss), and accurate conditional proba¬ 
bility estimates are crucial empirically. In particular, 
in medical diagnostic systems, a false positive means 
additional tests are needed, while a false negative could 
potentially be fatal. 

Summary of Notation 

We focus on the binary classification problem in this 
paper. Let us first define some notations here that will 
be used throughout the paper: 
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• (Y denotes the discrete feature space and y = 
{±1} denotes the label space. 

• V denotes the underlying distribution over X xy 
that governs the generation of datasets. 

• D = {(Ai, Yi),..., (A„, y„)} denotes a set of 
i.i.d. data points from V. 

• A fuzzy classifier is a function from X to [0,1] 
where the output denotes the estimated condi¬ 
tional probability of V{Y = 1|A). 

Interpretations of Conditional Probability Es¬ 
timates 

Ideally, we hope that our conditional probability esti¬ 
mates can be interpreted as the true conditional prob¬ 
abilities. This interpretation is justified if we can prove 
that the conditional probability estimates are close to 
the true values. Let li{f, V) be the li distance between 
the true distribution and the estimated distribution as 
a measure of the “correctness” of conditional proba¬ 
bility estimates: 

h{f,V) = Ex^v\f{X)-V{Y=l\X)\ 

Here A is a random variable representing the feature 
vector of a sample data point, Y is the label of A and 
/(A) is a fuzzy classifier that estimates 'P{Y = 1|A). 
If we can prove that hifjV) < e for some small e, then 
the output of / can be approximately interpreted as 
the true conditional probability. 

Unfortunately, as we will show in this paper, it is 
impossible to guarantee any reasonably small upper 
bound for under the agnostic assumption. In 

fact, as we will demonstrate in this paper, for the 
cases where we have to make the agnostic assumption, 
the estimated conditional probabilities are usually no 
longer close to the true values in practice. 

Therefore, instead of trying to bound the li distance, 
we develop an alternative interpretation for these con¬ 
ditional probability estimates. We introduce the fol¬ 
lowing calibration definition for fuzzy classifiers: 
Definition 1. Let X be the feature space, y = {±1} 
be the label space and V be the distribution over X xy. 
Let / : A —>■ [0,1] be a fuzzy classifier, then we say f 
is calibrated if for any pi < P 2 , we have: 

Ex,.v[tp,<nx)<pJ{X)] = V{pi < /(A) <P 2 ,Y= 1) 

Intuitively, a fuzzy classifier is calibrated if its out¬ 
put correctly reflects the relative frequency of labels 
among instances they believe to be similar. For in¬ 
stance, suppose the classifier output /(A) = p for n 
data points, then roughly there are np data points with 
label Y = 1. We also define a measure of how close / 
is to be calibrated: 


Definition 2. A fuzzy classifier f is e-calibrated if 
c(/) = sup \ripi < fix) <P 2 ,Y = 1) 

Pl<P2 

- Ex~'p[llpi</(x)<p 2 /(A)]| < e 

/ is e-empirically calibrated with respect to D if 
1 " 

Cemp(/i D) =— sup I lpi</(yi)<p2,Vi = l 

n pi<p2 

n 

~'^'^Pi<f(Xi)<p 2 fi.Xi)]\ <e 

i=l 

where D = {(A^, Yi ),..., (A„, Y„)} is a size n dataset 
consisting of i.i.d. samples from V. 

Note that the empirical calibration measure Cempif, D) 
can be efficiently computed on a finite dataset. We 
further prove that under certain conditions, Cemp if,D) 
converges uniformly to c(/) over all functions / in a 
hypothesis class. Therefore, the calibration property 
of these classifiers can be demonstrated by showing 
that they are empirically calibrated on the training 
data. 

The calibration definition is motivated by analyzing 
the properties of commonly used conditional proba¬ 
bility estimation algorithms: many such algorithms 
will generate classifiers that are naturally calibrated. 
Our calibration property justifies the common practice 
of using calibrated conditional probability estimates 
as true conditional probabilities: we show that if the 
fuzzy classifier is calibrated and the output of the clas¬ 
sifier is the only source of information, then the opti¬ 
mal strategy is to apply Bayes Decision Rule on the 
conditional probability estimates. 

The uniform convergence result of Cempif, D) and c(/) 
has several applications. First, it can be directly used 
to prove a fuzzy classifier is (almost) calibrated, which 
is necessary for the conditional probability estimates 
to be interpretable. Second, it suggests that we need to 
minimize the empirical calibration measure to obtain 
calibrated classifiers, which is a new direction for de¬ 
signing conditional probability estimation algorithms. 
Finally, taking an uncalibrated conditional probability 
estimates as input, we can calibrate them by minimiz¬ 
ing the calibration measure. In fact, one of the most 
well-known calibration algorithm, the isotonic regres¬ 
sion algorithm, can be interpreted this way. 

Paper Outline 

The rest of this paper is organized as following. In Sec¬ 
tion we argue that the li distance cannot be prov- 
ably bounded under the agnostic assumption (Theo¬ 
rem and then motivate our calibration definition. 
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In Sectionj^we present the uniform convergence result 
(Theorem^ and discuss the potential applications. In 
Section]^ we conduct experiments to illustrate the be¬ 
havior of our calibration measure on several common 
classification algorithms. 

Related Work 

Our definition of calibration is similar to the definition 
of calibration in prediction theory (Foster and Vohra, 
1998), where the goal is also to make predicted prob¬ 
ability values match the relative frequency of correct 
predictions. In prediction theory, the problem is for¬ 
mulated from a game-theoretic point of view: the se¬ 
quence generator is assumed to be malevolent, and the 
goal is to design algorithms to achieve this calibration 
guarantee no matter what strategy the sequence gen¬ 
erator uses. 

To the best of our knowledge, there is no other work 
addressing the interpretability of conditional probabil¬ 
ity estimates in agnostic cases. Our dehnition of cal¬ 
ibration is also connected to the problem of calibrat¬ 
ing conditional probability estimates, which has been 
studied in many papers (Zadrozny and Elkan, 2002) 
(Platt, 1999). 

2 The Calibration Definition: 

Motivation &: Impossibility Resnlt 

2.1 Impossibility result for li distance 

Recall that the li distance between / and V is defined 
as: 

hU,V)=V.x^v\f{X)-V{Y =l\X)\ 

Suppose / is our conditional probability estimator that 
we learned from the training dataset. We attempt to 
prove that the h distance between / and V is small 
under the agnostic setting. With the agnostic setting, 
we do not know anything about V, and the only tool 
we can utilize is a validation dataset Dyai that consists 
of i.i.d. samples from V. Therefore, our best hope 
would be a prover Af{D) that: 

• Returns 1 with high probability if (/, 7^) is small. 

• Returns 0 with high probability if (/, V) is large. 

The following theorem states that no such prover ex¬ 
ists, and the proof can be found in the appendix. 
Theorem 1. Let Q he a probability distribution over 
X, and / : T —[0,1] be a fuzzy classifier. Define Bf 
as: 

Bf=Ex^QTnin{f{X)A-f{X)) 

If we have that Vx G X, Q{x) < then there is no 

prover Af : {X x y}” —{0,1} for f satisfying the 
following two conditions: 


For any V over X x y such that Vx = Q (i.e.,'ix G 
X = Qix))> suppose Dyai G {X x y}” 
is a validation dataset consisting of n i.i.d. samples 
from V: 

1. IfhifiV) = 0, then Po^MiD^ai) = 1) > §• 
IfhifiV) > then Po^^MfiDvai) = 1) < |- 

We made the assumption in Theorem to exclude 
the scenario where a significant amount of probabil¬ 
ity mass concentrates on a few data points so that 
their corresponding conditional probability can be es¬ 
timated via repeated sampling. Note that the state¬ 
ment is not true in the extreme case where all proba¬ 
bility mass concentrates on one single data point (i.e., 
3x G X,Q{x) = 1). The assumption is true when 
the feature space X is large enough such that it is 
almost impossible for any data point to have signif¬ 
icant enough probability mass to get sampled more 
than once in the training dataset. 

The significance of Theorem [T] is that any attempt to 
guarantee a small upper bound of li{f,V) would def¬ 
initely fail. Thus, we can no longer interpret the con¬ 
ditional probability estimates as the true conditional 
probabilities under the agnostic setting. This result 
motivates us to develop a new measure of “correct¬ 
ness” to justify the conditional probability estimates. 

2.2 liifyV) in practice 

The fact that we cannot guarantee an upper bound 
of the li distance is not merely a theoretical artifact. 
In fact, in the cases where we need to make the ag¬ 
nostic assumption, the value of li{f,V) is often very 
large in practice. Here we use the following document 
categorization example to demonstrate this point. 
Example 1. Denote Z to he the collection of all En¬ 
glish words. In this problem the feature space X = Z* 
is the collection of all possible word sequences, and 
y denotes whether this document belongs to a cer¬ 
tain topic (say, football). Denote V as the following 
data generation process: X is generate from the Latent 
Dirichlet Allocation model (Blei et al., 2003), and Y 
is chosen randomly according to the topic mixture. 

We use logistic regression, which is parameterized by 
a weight function w : Z ^ M., and two additional pa¬ 
rameters a and b. For each document X = ziZ 2 ... Zk, 
the output of the classifier is: 


1 -f exp(-a w{zi) - b) 

The reason that we are using automatically gener¬ 
ated documents instead of true documents here is that 
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the conditional probabilities P{Y\X) are directly com¬ 
putable (otherwise we cannot evaluate li{J,V) and 
other measures). We conducted an experimental sim¬ 
ulation for this example, and the experimental details 
can be found in the appendix. Here we summarize 
the major findings: the logistic regression classifier 
has very large h error, which is probably due to the 
discrepancy between the logistic regression model and 
the underlying model. However, the logistic regression 
classifier is almost naturally calibrated in this exam¬ 
ple. This is not a coincidence, and we will discuss the 
corresponding intuition in Section 


2.3 The Motivation of the Calibration 
Measure 

Let us revisit Example This time, we fix the 
word weight function w. In this case, every docu¬ 
ment X can be represented using a single parameter 
w(X) = and we search for the optimal a 

and h such that the log-likelihood is maximized. This 
is illustrated in Figure 


This behavior is not unique to logistic regression. 
Many other algorithms, including decision tree clas¬ 
sifiers, nearest neighbor (NN) classifiers, and neural 
networks, exhibit similar behavior: 

• In decision trees, all data points reaching the same 
decision leaf are considered similar. 

• In NN classifiers, all data points with the same 
nearest neighbors are considered similar. 

• In neural networks, all data points with the same 
output layer node values are considered similar. 

We can abstract the above conditional probability es¬ 
timators as the following two-step process: 

1. Partition the feature space X into several regions. 

2. Estimate the relative frequency of labels among 
all data points inside each region. 


2.3 



Figure I: Illustration of Example 


Now, intuitively, to maximize the log-likelihood, we 
need the sigmoid function to match 

the conditional probability of Y conditioned on w(A): 
V{Y = l|r(;(A)). Therefore, for the optimal a and b, 
we could say that the following property is roughly 
correct: 


ViY = I|w(A)) 


1 

1 -I- exp(—ai(;(A) — b) 


In other words. 


VO < p < i,E[iP(r = i\x)\f{x) = p] « p 


Let us examine this example more closely. The rea¬ 
son why the logistic regression classifier tells us that 
f{X) « p is because of the following: among all the 
documents with similar weight w{X), about p portion 
of them actually belong to the topic in the training 
dataset. This leads to an important observation: lo¬ 
gistic regression estimates the conditional probabilities 
by computing the relative frequency of labels among 
documents it believes to be similar. 


The definition of the calibration property follows easily 
from the above two-step process. We can argue that 
the classifier is approximately calibrated, if for each 
region S in the feature space X, the output conditional 
probability of data points in S is close to the actual 
relative frequency of labels in S. The definition for the 
calibration property then follows from the fact that all 
data points inside each region have the same output 
conditional probabilities. 

Vpi<p2, V{pi < f{X) <P2,Y =1) 

Using Calibrated Conditional Probabilities in 
Decision Making 

The calibration property justifies the common practice 
of using estimated conditional probabilities in decision 
making. Consider the binary classification problem 
with assymetric misclassification loss: we lose a points 
for every false positive and b points for every false neg¬ 
ative. In this case, the best decision strategy is to 
predict 1 ii V{Y = 1|A) > and predict —1 oth¬ 
erwise. Now consider the case when we do not know 
V(Y = 1|X), but only know the value of f{X) instead. 
If we can only use f{X) to make decision, and / is cal¬ 
ibrated, then the best strategy is to use f{X) in the 
same way as 'P{Y = 1|A) (the proof can be found in 
the appendix): 

Claim 1. Suppose we are given a calibrated fuzzy clas¬ 
sifier / : T —>■ [0,1], we need to make decisions solely 
based on the output of f. Denote our decision as 
g : [0,1] {±1} (i-e., our decision for X is g{f{X))). 
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Then the optimal strategy g* to minimize the expected 
loss is the following: 


9*{x) 


1 

-1 


X > 
X < 


a-\-h 

a 

a-\-b 


3 Uniform Convergence of the 
Calibration Measnre 


3.1 The Uniform Convergence Result 

Let G he a, collection of functions from T x 3^ to [0,1], 
the Rademacher Complexity (Bartlett and Mendelson, 
2003) of ^ with respect to D is defined as (Shalev- 
Shwartz and Ben-David, 2014): 

1 "■ 

Rd{G) = [sup cTig{xi, yt)] 

” see 


Then we have the following result: 

Theorem 2. Let F he a set of fuzzy classifiers, i.e., 
functions from X to [0,1]. Let TL be the set of binary 
classifiers obtained by thresholding the output of fuzzy 
classifiers in F: 

^ ~ {^Pi<f{x)<P 2 ■ P11P2 € K, / e F} 

Suppose the Rademacher Complexity of TL satisfies: 

EoRoin) + ^ 

\ n 2 

Then, 

Pr£,(sup |c(/) -Cenip{f,D)\ > c) < 5 
feF 


The proof of this theorem, together with a discussion 
on the hypothesis class TL, can be found in the ap¬ 
pendix. 


3.2 Applications of Theorem 

3.2.1 Verifying the calibration of classifier 

The first application of Theorem is that we can 
verify whether the learned classifier / is calibrated. 
For simple hypothesis spaces F (e.g., logistic regres¬ 
sion), the corresponding hypothesis space TL has low 
Rademacher Complexity. In this case. Theorem 
naturally guarantees the generalization of calibration 
measure. 

There are also cases where the Rademacher Complex¬ 
ity of TL is not small. One notable example is SVM 
classifiers with Platt Scaling (Platt, 1999): 

^Our definition of Rademacher Complexity comes from 
Shalev-Shwartz and Ben-David’s textbook (2014), which 
is slightly different from the original definition in Bartlett 
and Mendelson’s paper (2003). 


Claim 2. Let T C and Vx S T, ||x ||2 < 1. Let F 
he the following hypothesis class: 


1 + exp(aw’^x + b) 
w eR^,\\w \\2 < B, a, b gR} 

If the training data size n < d and the training data 
Xi are linearly independent, then Rd{TL) = 

The proof can be found in the appendix. In the case of 
SVM, the dimensionality of the feature space is usu¬ 
ally much larger than the training dataset size (this is 
especially true for kernel-SVM). In this situation, we 
can no longer verify the calibration property using only 
the training data, and we have to keep a separate val¬ 
idation dataset to calibrate the classifier (as suggested 
by Platt (1999)). When verifying the calibration of 
classifier on a validation dataset. The hypothesis class 
F = {/}, and it is easy to verify that EdRd{TL) is 
0{^J\ogn/n) in this case. Therefore, with enough val¬ 
idation data, we can still bound the calibration mea¬ 
sure. 

3.2.2 Implications on Learning Algorithm 
Design 

Standard conditional probability estimation usually 
maximizes the likelihood to find the best classifier 
within the hypothesis space. However, since we can 
only guarantee the conditional probability estimates 
to be calibrated under the agnostic assumption, any 
calibrated classifier is essentially as good as the maxi¬ 
mum likelihood estimation in terms of interpretability. 
Therefore, likelihood maximization is not necessarily 
the only method for estimating conditional probabili¬ 
ties. 

There are other loss functions that are already widely 
used for binary classification. For example, hinge loss 
is at the foundation of large margin classifiers. Based 
on our discussion in this paper, we believe that these 
loss functions can also be used for conditional prob¬ 
ability estimation. For example, Theorem suggests 
the following constrained optimization problem: 

inm£{f,D) s.t. 

C-emp {f,D)=0 

where £(/, D) is the loss function we want to minimize. 
By optimizing over the space of empirically calibrated 
classifiers, we can ensure that the resulting classifier is 
also calibrated with respect to V. 

In fact, the conditional probability estimation algo¬ 
rithm developed by Kakade et al. (2011) already im¬ 
plicitly follows this framework (more elaboration on 
this point can be found in the appendix). We believe 
that many more interesting algorithms can be devel¬ 
oped along this direction. 
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3.2.3 Connection to the Calibration Problem 

Suppose that we are given an uncalibrated fuzzy clas¬ 
sifier ff) : X ^ [0,1], and we want to find a function g 
from [0,1] to [0,1], so that go presents a better con¬ 
ditional probability estimation. This is the problem of 
classifier calibration, which has been studied in many 
papers (Zadrozny and Elkan, 2002) (Platt, 1999). 

Traditionally, calibration algorithms find the best link 
function g by maximizing likelihood or minimizing 
squared loss. In this paper, we suggest a different ap¬ 
proach to the calibration problem. We can find the 
best g by minimizing the empirical calibration mea¬ 
sure Cemp(5 o /o)- Let us assume w.l.o.g. that the 
training dataset D = {(xi, j/i),..., (a;„, y„)} satisfies 

g{k{xi)) < ... < g{h{xn)) 

Then we have, 

Cemp (9° fo,D) 

1 " 

= — sup I Ilpi<g(/o(a;i))<p 2 (Lyi = l ~ 9ifoiXi)))\ 

^ P1.P2 

<-max| - g{fo{xi)))\ 

71 a,b 

a<.i<b 

This expression can be used as the objective function 
for calibration: we search over the space of hypothe¬ 
sis Q to find a function g that minimizes this objec¬ 
tive function. Compared to other loss functions, the 
benefits of minimizing this objective function is that 
the resulting classifier is more likely to be calibrated, 
and therefore provides more interpretable conditional 
probability estimates. 

In fact, one of the most well-known calibration al¬ 
gorithms, the isotonic regression algorithm, can be 
viewed as minimizing this objective function: 

Claim 3. Let Q be the set of all continuous non¬ 
decreasing functions from [0,1] to [0,1]. Then 
the optimal solution that minimizes the squared 
loss minggg Sr=i(Lyi=i ~ gifoixi)))'^ also minimizes 
mingeg max^.h ] T,a<i<bi^y^=^ “ 9ifo{xi)))\ 

The proof can be found in the appendix. Using this 
connection we proved several interesting properties of 
the isotonic regression algorithm, which can also be 
found in the appendix. 

4 Empirical behavior of the 
calibration measure 

In this section, we conduct some preliminary experi¬ 
ments to demonstrate the behavior of the calibration 


measure on some common algorithms. We use two bi¬ 
nary classification datasets from the UCI Repositor}0 
ADULdEland COVTYPEI!] COVTYPE has been con¬ 
verted to a binary classification problem by treating 
the largest class as positive and the rest as negative. 
Five algorithms have been used in these experiments: 
naive Bayes(NB), boosted decision trees, SVyQ , lo¬ 
gistic regression(LR), random forest(RF). 



NB Boosted Tree SVM LR RF 


Figure 2: The empirical calibration error 

Figure shows the empirical calibration error Cemp 
on test datasets for all methods. From the exper¬ 
imental results, it appears that Logistic Regression 
and Random Forest naturally produce calibrated clas¬ 
sifiers, which is intuitive as we discussed in the paper. 
The calibration measure of Naive Bayes seems to be 
depending on the dataset. For large margin methods 
(SVM and boosted trees), the calibration measures are 
high, meaning that they are not calibrated (on these 
two datasets). 

There is also an interesting connection between the 
calibration error and the benefit of applying a calibra¬ 
tion algorithm, which is illustrated in Figure [^ In 
this experiment, we used a loss parameter p to control 
the asymmetric loss: each false negative incurs 1 — p 
cost and each false positive incurs p cost. All the algo¬ 
rithms are first trained on the training dataset, then 
calibrated on a separate validation set of size 2000 us¬ 
ing isotonic regression. For each algorithm, we com¬ 
pute the prior-calibration and post-calibration average 
losses on the testing dataset using the following deci¬ 
sion rule: For each data point X, we predict Y = I if 
and only if we predict that Pr(Y = IjA) > p. Finally, 
we report the ratio between two losses: 

the average loss after calibration 

loss TcltlO — ^ 

the average loss before calibration 

^These datasets are chosen from the datasets used in 
Niculescu-Mizil and Caruana’s work (2005). We only used 
two datasets because the experiments are only explorative 
(i.e., identifying potential properties of the calibration 
measure). More rigorous experiments are needed to for¬ 
mally verify these properties. 

^https://archive.ics.uci.edu/ml/datasets/Adult 
https://archive.ics.uci.edu/ml/datasets/Covertype 
®For SVM and boosting, we rescale the output score to 
[0,1] by {x — min)/(max — min) as in Niculescu-Mizil and 
Caruana’s paper (2005) 










Yihan Gao, Aditya Parameswaran, Jian Peng 


As we can see in the Figure]^ the calibration proce- 



loss parameter 


(a) Adult 



Figure 3: The loss ratio on two datasets 

dure on average reduces the cost by 3%-5% for naive 
Bayes and random forest, 20% for SVM, 12% for 
boosted trees, and close to 0% for logistic regression. 
Comparing with the results in Figure]^ the two algo¬ 
rithms that benefit most from calibration (i.e., SVM 
and boosted trees) also has high empirical calibration 
error. This result suggests that if an algorithm al¬ 
ready has a low calibration error to begin with, then 
it is not likely to benefit much from the calibration 
process. This finding could potentially help us decide 
whether we need to calibrate the current classifier us¬ 
ing isotonic regression (Niculescu-Mizil and Caruana, 
2005). 

5 Conclusion 

In this paper, we discussed the interpretability of con¬ 
ditional probability estimates under the agnostic as¬ 
sumption. We proved that it is impossible to upper 
bound the li error of conditional probability estimates 
under such scenario. Instead, we defined a novel mea¬ 
sure of calibration to provide interpretability for con¬ 
ditional probability estimates. The uniform conver¬ 
gence result between the measure and its empirical 
counterpart allows us to empirically verify the calibra¬ 
tion property without making any assumption on the 
underlying distribution: the classifier is (almost) cali¬ 
brated if and only if the empirical calibration measure 
is low. Our result provides new insights on conditional 
probability estimation: ensuring empirical calibration 
is already sufficient for providing interpretable condi¬ 
tional probability estimates, and thus many other loss 
functions (e.g., hinge loss) can also be utilized for es¬ 


timating conditional probabilities. 
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Appendix 

Experimental Simulation of Example 

Here we experimentally simulate Example to illus¬ 
trate that logistic regression classifier has large h error. 
We use Latent Dirichlet Allocation (LDA) (Blei et ah, 
2003), the state of the art generative model for docu¬ 
ments, to generate datasets. The detailed experiment 
settings are listed below: 

• The dataset consists of 20000 documents, the 
number of topics is 20, the dictionary size is 1000, 
and the average number of words in each docu¬ 
ment is 200. 

• We use the non-informative Dirichlet prior a = 
(1,1,..., 1) over topics. The word distribution in 
each topic follows power law with a random order 
among words. 


1. With probability at least 1 — 5i, a dataset D with 
n i.i.d. samples from V will pass V: 

ProiViD) = 1) > l-(5i 

2. With probability at least 1 — 52, « dataset D with 
n i.i.d. samples from V satisfies: 

Pr(Vi ^ A,) >1-52 

Then there exists another probability distribution V' 
such that: 

1. With probability at least l — 5i — 52, a data D' with 
n i.i.d. samples from V' will also pass V. 

PrD'{ViD') = l)>l-6i-52 

2 . 

VA e A, ^ riX,Y) = V'{X,Y) 

Yey Yey 


• For each document, we randomly sample with re¬ 
placement 10 topic labels from the topic distribu¬ 
tion. 


Table reports the mean experiment results and the 
standard deviation across five runs. For reference we 
also include the relative frequency of labels, and the 
1 1 error achieved by the trivial classifier that always 
output the global relative frequency of labels as con¬ 
ditional probability. 


Average l\ Error 

Empirical Calibration 

0.1270 ± 0.0008 

0.0083 ± 0.0003 

Trivial Zi Error 

Frequency of Labels 

0.2022 ± 0.0001 

0.3448 ± 0.0001 


Table 1: Li error and empirical calibration 

As we can see from Table the logistic regression 
only achieves 0.13 average h error, while even the triv¬ 
ial classifier can achieve 0.2. This implies that logis¬ 
tic regression performed very badly in this example. 
However, as we can see from Table the empirical 
calibration measure of logistic regression classifier is 
relatively low (0.01), indicating that the classifier is 
almost calibrated. 

Proof of Theorem [T] 

Proof. The proof relies on the following lemma: 


3. 


VA e X,r'{Y = 1|A) = 0 or 1 


Proof. First we construct the following distribution 
over all possible V' satisfying the last two conditions: 

Pr(P')= n QiP'{y = MX),P{Y = l\X)) 


where Qip'^p) is defined as: 


Qip',p) 


p p' = 1 

1 — p p' = 0 


Now it is sufficient to show that if we sample V' ac¬ 
cording to the above distribution and then sample D' 
from V' , then with probability at least 1 — 5i — 52, D' 
will pass V. Assuming this is true, then at least one 
distribution V' have to satisfy the first condition, and 
thereby proved the existence of V'. 

To compute the probability that D' would pass 
V, denote Dx = {Ai, A 2 ,..., A„} and Dy = 
{Yi,Y 2 , ... ,Yn}. Note that all P' has the same 
marginal distribution over X, therefore: 

Pr-pcD'iViD') = 1) =^Pr(P')5IP^(^'l^')r(D') 

-P' D' 

= ^ Pr{D'x) E Pr(D;.|P', D'x)ViD') 

D'x ■P' ^'y 


Lemma 1. Let V be a distribution over X y.y. Let D 
be a size n i.i.d. sample set from V. Let V be a verifier 
ofV given D (i.e., V is a function from {X x A’}" to 
{0,1 }), such that 


We only consider all those D'j.r with distinct Xi values. 
Based on the assumption, such D'^ accounts for at 
least 1 —^2 of the probability mass. Now the important 
observation is that for every fixed D'^ with distinct 
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X values, the marginal distribution of D'y given D'^ 
(i.e. marginalize over V') is exactly V{Dy\D'^), the 
distribution that we sample labels independently from 
'P{Y\X) for each X[ in D'-y-. 

Prp'v) E E D'^)V{D') 

D'x ■P' ^'y 

> E E ^r{D'y\V, D'^)V{D') 

D'x D'^ 

The latter probability is actually the probability that 
D' will pass V and have distinct X values at the same 
time. Based on the assumptions in the lemma, it oc¬ 
curs with probability at least 1 — i5i — <52. □ 

Now given this lemma, the proof of Theorem is easy: 
We show that if any prover Af satishes the two con¬ 
ditions in the theorem, it can be used as the veriher 
V in the lemma such that no P' can satisfy all three 
conditions. 

Let (5i = |, then the first assumption in the lemma is 
satished, also since Vx € T, Q(x) < we have: 


Define S = {f(X) : X £ X}, then we have: 
ar{g{f{X)) = 1,Y = -1)+ br{g{f{X)) =-1,Y = 1) 

= E E {atgip)=iV{Y = -l,X) + 

pesx-.f(x)=p 
61g(p)_iP(r = l,A)] 

=E[“i9(p)=i E 'Piy = -hx)+ 

pes X:f{X)=p 

»P(P)=-1 E 'P{Y=l,X)] 

X-.f{X)=p 

Therefore, the optimal g* has g* (p) = 1 if and only if: 

a E 'P{Y = -l,X)<b E 'P{Y = l,X) 
X-.f{X)=p x-f(x)=p 

Which is equivalent as: 

aV(Y = -l\f{X) =p)< bV{Y = 1|/(A) = p) 

Since / is calibrated, ViY = 1|/(A) = p) = p, there¬ 
fore g* (p) = 1 if and only if p > . □ 

Proof of Theorem [2] 


Vz^j,Pr(A,=A,) = E 2 (^)'^ 

X 


1 

lOn^ 


By a union bound, we have: 


Proof. We will use the following uniform convergence 
result (Shalev-Shwartz and Ben-David, 2014): 

Theorem 3. Let D be i.i.d. samples of {X x y^V), 
then with probability at least 1 — i5, 




suPgeg \^J2'^=i9ixi,yi) - 'E9{X,Y)\ 

< 2EdRd{G) + (1) 


Therefore we can set 82 = 0.1. By the above lemma, 
there exists another V' such that 

and 

\/X£X,Y £ y,V'(X,Y) = 0 or 1 

On the other hand, note that the li distance between 
V' and V is at least B, then by the properties oi Af, 
D' cannot pass Af with probability greater than 
This contradicts our earlier result. Therefore no such 
Af can exist. 

□ 


Proof of Claim [T] 

Proof. The expected loss is 

aV{g{f{X)) = 1,Y = -l) + bV{g{f{X)) = -1, F = 1) 


In the following we sometimes allow ^ to be a collec¬ 
tion of functions from X to [0,1] in the above results. 
When used in this sense, we assume that the function 
will not use y label: g(x,y) = g(x). 

Define Xd.px.p^ if) to be the relative frequency of event 
{Pi < fix) <P 2 ,y = 1}: 

1 " 

J'D,pi,p2if) = ~ E! ^Pl<.f(xi)<P2,yi = i 

^ i=l 

Define Xv,pi,p 2 if) to be the probability of the same 
event: 


Xv.p^p^if) =ViPi < fix) < P 2 .Y = 1 ) 

Define £D,pi,p 2 if) as the empirical expectation of 
/ ix)^pi<:f(x)<p 2 ■ 


£d-, 


Af) 


1 

n 


n 

f {^i)^pi<f{xi)<p2 

2=1 
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Define £'p,pi,p 2 if) the expectation of the same func¬ 
tion: 

^V,pi,p2if) = E[/(-^)lpi</(X)<p2] 

When the context is clear, subscripts pi and p 2 can be 
dropped. Using these notations, we can rewrite c(/) 
and CempifjD) as follows: 

c(/) = sup \J^vif) - £vif)\ 

Pl,P2 

Cempif) = sup iToif) - SoiDl 
Pl,P2 

Note that: 

I sup |J^d(/) - Soif)] - sup \Tsif) - £sif)\\ 

Pl,P2 Pl,P2 

< sup \\:FDif) - Soif)] - |^s(/) - £sif)\\ 

Pl,P2 

< sup \:FDif) - £Dif ) - J^sif) + £sif)\ 

Pl,P2 

< sup ilJ^Dif) - J^sif)\ + \£Dif) - £sif)\) 

Pl,P2 

< sup |J'd(/) - J^sif)\ + sup |£d(/) - £sif)\ 

Pl,P2 Pl,P2 

Therefore it suffices to show that 

P( sup - ^sif)\ + 

f,Pl,P2 

sup |£d(/) - £sif)\ > e) <s 

f,Pl,P2 


where the last step is because ti = aimax{zi,yi) is 
uniformly distributed over {±1} independent of the 
value of Ui- 

For i?D(H 2 ), we have: 

RsM 2 ) 

1 " 

= sup ^ , '^ifi^i)^Pi<.f(xi)<P 2 ] 

1 C " 

sup / fTillt<; Ipj dt] 

Pl,P2jJ0 

— / [ sup U'ill„2ax(pi,i)</(a:i)<p2]^^ 

^ Jo Pl,P2j 

/ [ sup ^ ^ <.f(xi)<P2]^^ 

n do p'i>t,P2,/i^i 

1 a " 

— “^^^{± 1 }" / [ sup CTilp' <f(xi)<p 2 \dt 

^ Jo P[,P 2 J 

1 " 

N <.f(xi)<P2] 

"• Pl,P2,f 

=RDiH) 

where the second step is due to f{x) = 
and the forth step is just substituting max(pi,t) with 
p{. Since there is no constraint on pi , the Pi can take 
any value greater than or equal to t. □ 


Define 

R-i = {lpi</(a:)<P2.p=i ■ Pi^P2 S K, / e 
R 2 {/(^)^pi</(a:)<p2 ■ P^->P 2 € 1^, f € J~^ 

Then we have the following lemma: 

Lemma 2. Let Hi,1^.2 as defined above, then: 

Roini) < Roin) RDin 2 ) < Rain) 
Proof. For we have: 

RoiRi) 

1 " 

= ~Ect~{± 1 }"[ sup '^i^Pi<.f(xi)<p 2 ,yi = l] 


Combining this lemma with the assumptions in the 
theorem: 

^DRDiH 2 ) + < "x 

\ n 2 

By Equation Q: 

P( sup |^^(/)-J-s(/)|>l)<^ 

f,Pl,P2 ^ ^ 

P( sup \£Dif) - £sif)\ > 1) <i 

f,Pl,P2 ^ ^ 


= -E<^...{±i}-[ sup y]]cr,lp^</(2,.)<p2E^.g{±i} max(0 

^ PlyP 2 j 

1 " 

<-E<^,2...{±i}"[ sup y^ ^Pi<f(xi)<p2<^i max(zi, yj)] 

^ Pl,P 2 j 

1 " 

^'^P y~! ^i^Pi<.f(xi)<P2] 

=RD{n) 


Proof of Claim 

Proof. For any a € {±1}", we can find a vector w 
such that for every we have w^JCi = ai (this is 
always possible since the number of equations n is less 
than the dimensionality d). Let w* = so that 

||z (;*||2 = B, and let a = A||u;|| 2 /i? and 6 = 0. Then 
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1. For z = 0,..., n, Compute 
Pi = (z, Si = J2j<i 

2. Let cv{P) be the convex hull of the set of points 

P^ 

3. For z = 0,..., rz, Let Zi = intersection of cv{P) 
and the line x = i 

4. Compute Zi = Zi — Zi-\ 

5. Let g{fo{xi)) = Zi, extrapolate these points to 
get continuous nondecreasing function g. 

Algorithm 1: Isotonic Regression Calibration Algo¬ 
rithm (PAV Algorithm) 


we have: 

fix^) 


1 

1 + exp{a{w*)^x + b) 


1 

1 


Let A -)► -oo, then ^rf{Xi) and 

the conclusion of the claim follows easily. □ 


The Hypothesis Class % 

In Theorem H is the collection of binary classi¬ 
fiers obtained by thresholding the output of a fuzzy 
classifier in P. For many hypothesis classes P, 
the Rademacher Complexity of % can be naturally 
bounded. For instance, if P is the d-dimensional gen¬ 
eralized linear classifiers with monotone link function, 
then E£)i?£)('H) can be bounded by 0 {^Jd\ogn/n). 
We remark that "H is different from the hypothesis class 
'Ppi,P 2 ^ where the thresholds are fixed in advance: 

'^pi,P2 ~ {^pi<fix)<p2 ■ / € T~L} 

In general, the gap between the Rademacher Complex¬ 
ities of Ho and Hp^^p^ can be arbitrarily large. The 
following example illustrates this point. 

Example 2. Let X = {!,..., rz}, and Ai, A 2 ,..., A 2 >* 
be a sequenee of sets containing all subsets of X. Let 
H be the following hypothesis space: 

X = {Mx) = ^ - : z G {1, 2,..., 2-}} 

Intuitively, P contains 2" classifiers, the ith classifier 
produces a output of either ^ or ^ depending 

on whether x G Ai. One can easily verify that for any 
pi,P 2 , the VC-dimension (Vapnik and Chervonenkis, 
1971) ofHp-^ p^ is at most 2, but the VC-dimension of 
H is n. 

However, if for any x G X,f G P, we have f{x) G P* 
with |P*| < 00 , then Rd{H) can be bounded using the 
maximum VC-dimension of Hp^^^p^ and log |P*|: 


Claim 4. If for any f G P,x G X, we have f{x) G P* 
where P* is a finite set, and for allpi,p 2 G M, the VC- 
dimension of hypothesis space Hp^ ^p^ is at most d, then 
for any sample D of size n with n > d -\- 1 we have: 

«„(«) < y/Mto3 + i) + -to(iri±jI 

Proof. By Massart Lemma (Shalev-Shwartz and Ben- 
David, 2014), we have: 

where H{D) is the restriction of H to D. It suffices to 
show that 


mD)\<{\P*\ + l?{en/df 

Note that 

H{D) = (iA) 

Since f{x) only takes finite possible values, we only 
need to consider values of pi,p 2 in P*U{— 00 }. There¬ 
fore by union bound we have 

|H(5„)|< ^ |Hp,.p2(5„)| 

Pi ,P2€-P*U{ —00} 

Since each Hp^^p.^ has VC-dimension at most d, 
by Sauer’s Lemma (Shalev-Shwartz and Ben-David, 
2014): 

Vpi,p 2 , |'Hpi,p2(5'„)| < {en/df 

Combining the last two inequalities, we get the desired 
result. □ 

Proof of Claim [s] 

Proof. For reference, the pseudo-code of the PAV al¬ 
gorithm for isotonic regression (Niculescu-Mizil and 
Caruana, 2005) can be found in Algorithm 

Let Zi = g{fo{xi)), then we can rewrite the objective 
function as: 

max I V -z,)| 

a,b 

a<.i<b 


To prove Algorithm also minimizes this objective 
function, we first state the minimization problem as a 
linear programming: 

min^1-^6 s.t. 6,6 >0 

0 < Zi < Z2 < • ■ • < Zn < 1 
VI < fc < 7Z, ^ Zi > ^ = l 

i<k i<k 

VI < fc < n, ^ Zi < ^ + n -^2 

i<.k i<.k 
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Define Sk = Y.i<k lyi=i and Zk = Y.i<k Then we 
have the following constraints: 

< k < n — I, Zk — Zk-i < Zk+i — Zk 
\/l < k < n,Sk - n^i < Zk < Sk + n^2 

Let Z* be the solution produced by Algorithm it 
should be obvious that Z* < Si for all i. Therefore, 

e; = - niin(,5, -Z*) = 0 Ci = - max(5, - Z*) 
n i n i 

We need to prove that ^ for every feasible 

solution {Zi,^i). Suppose^* = and Z* lies 

on the line segment {(j, Sj), (fc, Sk)}- Then we have: 


In particular, when p = 0.5, the empirical accu¬ 
racy of the calibrated classifier is always greater 
than or equal to the empirical accuracy of the orig¬ 
inal classifier. 

Proof. Throughout the proof, let C be the convex hull 

computed in Algorithm 

C — {(^0 — Si-^ ), . . . , (Ztti—I, Si^_-,^ ), {fm — “^n)} 

We will use the following notations: 

k k 

— 9 (yo(^i)) ^k — ^ ^ Sk — ^ ^ 

2=1 2=1 


S, - nCi = Z* = ^Sk + '^S, 


k-j 


k- j 


Because of the convexity constraint of Z, it must sat¬ 
isfy the following inequality: 


< ^r^Zk 

k- j 


k-i 

k-j^^ 


Computing the difference between these two, we get 


z,-s, + <1* < \r^AZk - Sk) + - Sj) 


k-j 


k-j 


Substituting in 

Zi- Si> -nfi Zk - Sk < n ^2 Zj - Sj < nf ,2 
We get 

nft < nCi + <2 

which proves the optimality of □ 


1. For any pi,p 2 , let l,r be such that: 

I = max k r = max k 

k<n,Zf~^Pi k<n,Zf~^P2 

If no such k exists, let l,r he 0 respectively. By 
Algorithmic we have 

S^-^,-Si- 

Vij <k< ij+i,zk = ^^ 

ij+i - ij 

Thus we have {I, Si), (r, Sr) G C, Zi = Si,Zr = 
Sr, and therefore 

n n 

'y ^pi<zi<P2,vi='k ~ y ^ ipi<2i<p2^i 

i=l z=l 

= iZr - Zi) - {Sr -Sl)=0 
which implies that Cempig* ° fo) = 0 

2. Let a = max{i : fo{xi) < p}, b = max{i : Zi < p}, 
then we need to show that 


Properties of Isotonic Regression 

We can prove several interesting properties of isotonic 
regression using Theorem 

Claim 5. Let g* be the calibrating function produced 
by Algorithm^ then: 

1. The empirical calibration measure Cemp{g* ° fo, D) 
of the calibrated classifier is always 0. 

2. For any asymmetric loss {l—p,p) (i.e., each false 
negative incurs 1 — p cost and each false positive 
incurs p cost), the empirical loss of the calibrated 
classifier is always no greater than the original 
classifier (both using the optimal decision thresh¬ 
old p): 

n 

^^[(1 — P)'^g-^{fo(xi))<p,yi^l + P^g’^(fQ{xi))>p,yi^o] 
2=1 
n 

— ~ P)^fo{xi)<p,yi^l + P^fo{xi)>p,yi^Q] 

2=1 


b 71 

0- - = l + P XI 

2=1 2 = 6+1 

a n 

< - P)'^'^yi=l + P X 

2=1 2=a+l 

We consider two separate cases: 

(a) a < 6, in this case we only need to show that 

b 

X blyz=o - (1 - P)ly.=i] > 0 

2=a+l 

or equivalently, 

p[{b-a) - {Sk - +)] - (1 -p)(+ - +) > 0 
Rearrange terms, it suffices to show 
p{b- a) - {Sb - +) > 0 
Since + = Zb, Sa > + 

Sb - Sa < Zb - Za < Zb{b - a) < p{b - a) 
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(b) a > 6, in this case we only need to show 

a 

[Pty^=0 - (1 - P)hi = l] < 0 
or equivalently, 

p[{a -b)- {Sa - 5;,)] - (1 - p){Sa - 5b) < 0 
Rearrange terms, it suffices to show 
p{a -b)- {Sa - 5b) < 0 
Since 5b = Zb, Sa > Za 

Sa - Sb> Za- Zb > Zb+i{a -b) > p{a - b) 

□ 

We can also use Theorem to derive the following 
non-asymptotic convergence result of Algorithm 

Claim 6. Let F{t) = V{fo{X) < t) be the distribution 
function of fo{X), and define G{t) as: 

G{t)=V{fo{X)<t,Y=l) 

Let cv : [0,1] —> [0,1] be the convex hull of all points 
{F{t),G{t)) for all t G [0,1]. Define Ge as: 

Ga{t) = mf,(X)<t9*{fo{X))] 

Then under the same condition in Theorem^ 

P(sup \Ge{t) — cv{F{t))\ > 2e) < 5(5 

t 

In particular, ifV{Y = l|/o(A)) is monotonically in¬ 
creasing, then 

P(sup|Ge(f)-G(t)| >2e) <5(5 

t 

Let us explain the intuition behind this claim: F{t) 
is the percentage of data points satisfying fo{X) < t, 
and G{t) is F{t) times the conditional probability of 
y = 1 in the region {fo{X) < t}. Now consider points 
Pi = {i, Si) in Algorithm it is not hard to show 
that as n —> oo, the limit of points Pi are the curve 
{F{t),G{t)),t G [0,1] (after proper scaling). Similarly, 
Ge{t) is F{t) times the expected value of g*{fo{X)) 
in the region {/o(A) < t}, and it is not hard to show 
that {F{t),Ge{t)) is the limit of {i,Zi) (after proper 
scaling). Now the claim states that in the PAV algo¬ 
rithm, {F{t),Ge{t)) converge uniformly to the convex 
hull of {F{t),G{t)), which should not be surprising, 
since we explicitly computed the convex hull of {Pi} 
in Algorithm [l] 


When V{Y = l|/o(A)) is monotonically increasing 
w.r.t. fo{X), {F{t),G{t)) is convex, and Claimj^im- 
mediately implies that Ge{f) will converge uniformly 
to G(t). In this case, the PAV algorithm will even¬ 
tually recover the “true” link function g*{fo{X)) = 
V{Y = l|/o(A)) given sufficient training samples, and 
Claim provides a rough estimate of the number of 
samples required to achieve the desired precision. 

Proof. Throughout the proof, let G be the convex hull 
computed in Algorithm 

G — {(^0 — d,0),(fi, 5^^ ), . . . , {i,YL—l, Si ,„_.,^), {im — Sn 
We will use the following notations: 

k k 

Zi = g*{fo{xi)) Zk = Y^i 5fc = '^yi=i 

i=l i=l 

We will use the following facts in the proof of Theo¬ 
rem [51 

P( sup IFoig o fo) - Pv{g o /o)| >h < 7 ) 

r 

P( sup \£d{ 9 0 k)-£v{go k)\>h < Ty 
g,Pi,P2 z z 

For any t € [0,1], let g' be any continuous increasing 
function from [0,1] to [0,1]. Let k = max{f : fo{xi) < 
t},pi = —oo,p 2 = g'{t) in the above inequalities, then 
we have: 

P(|-5,-G(t)|> J)<^ (2) 

n l I 

k 

P(I^Eff'(/o(^0)-E[l/o(y)W(/o(A))]| >l)<l 

i=l 

Let g' be such that \\g' — g*||oo < A, where A > 0 
can be arbitrarily small. Let A 0, then the second 
inequality implies 

P{\^Zk-Ga{t)\>^)<^- (3) 

Let g' be such that \g'{x) —1| < A for any x. Let A I 0, 
then the second inequality implies 

P{\h-F{t)\>^-)<^- (4) 

For any t G [0,1], let k = max{i : fo{xi) < t}. Let 
[ij-i = I, ij = r] be the segment of G with I < k < r. 
Then we have 
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Si = Zi = Zk - {k - l)zk 
Sr = Zr = Zk + {r - k)zk 

On the other hand, by ([^, with probability at least 
1-5: 

-Sl>G{foixi))-^ -Sr>G{fo{Xr))-^ 
n in I 

Since cv is the convex hull of {F{t),G{t)), we have 

qG{fo{xi)) + (1 - q)G{fo{Xr)) > cv{F{t)) 

where q = ■ Combining all, with 

probability at least 1 — 5: 

^Zk F^[ql + il- q)r - k]zk + | > cv{F{t)) 

By Q, with probability at least 1 — |5: 

-I < Fifoixi)) + ^ -r < F{foixr)) + ^ 

n 2 n 2 

h > Fit) - I 

Therefore, we have with probability at least 1 — |5, 

-Zk + ^ > c.v{F{t)) 
n I 

Then by with probability at least 1 — 35, 

Ge(t) + 2e > cv{F{t)) 

Conversely, suppose {F{t),cv{F{t))) is on the line seg¬ 
ment between {F{a),G{a)) and {F{b),G{b)), then 


Discussion on Kakade’s Algorithm (2011) 

Kakade’s algorithm minimizes the following squared 
loss objective function: 

n 

£(m, w) = ^(j/i - u{w ■ Xi)) 

i=l 

where u is a non-decreasing 1-Lipschitz function and 
w satisfies ||w|| < W. In each iteration, the algorithm 
first fix u and search for the optimal w that minimizes 
the squared loss, then fix w and run a slightly modified 
version of the PAV algorithm (Algorithm[^ to find the 
optimal u. 

In Claim we proved that the PAV algorithm always 
produce a calibrated classifier, therefore Kakade’s al¬ 
gorithm can be viewed as alternating between the fol¬ 
lowing two steps: 

1. Search for the parameter w that minimizes the 
squared loss C{u,w). 

2. Find the link function u such that u{w ■ x) is em¬ 
pirically calibrated. 

In other words, each iteration of Kakade’s algorithm 
can be viewed as first optimizing the objective func¬ 
tion £(m, w), then projecting u{w ■ x) onto the space of 
empirically calibrated classifiers. An interesting ques¬ 
tion here is whether the algorithm would still work if 
we replace the squared loss function with any other 
loss function in the first step. 


G(a) = cviF{t)) - w{F{t) - F{a)) 


G{b) = cv{F{t)) + w{F{b) - F{t)) 

where w = (if F{a) = F{b) then just let 

w = 1). 

By § and ([^ and the fact that Sk > Zk, with prob¬ 
ability at least 1 — 25: 


G{a) + e > Ge{a) G{b) + e > Geib) 

Also since {F[t),Ge{t)) is convex, we have: 

gGe(a) + (l-g)Ge(6)>Ge(<) 

where q = F(bj-F(al ■ Combining all above, with prob¬ 
ability at least 1 — 25: 


Cv(F(t)) +€> Ge(t) 

Combining two directions, the proof is complete. □ 






